Add a faster string-search method with simd instructions than std::find #10858

skadilover · 2024-08-27T12:42:38Z

Description

In practice, c_strstr is always faster than std::find in most cases, refer to [1]. However, the c_strstr interface will truncate '\0', which makes it unable to handle unicode characters. In addition, glibc uses two-way- search, see[2], but it does not seem to be exposed as a common interface. It requires some additional code to determine the best path. In some scenarios, the decision-making code may cause performance degradation.

Solution

There are many optimization algorithms for string-search, refer to [7], but they cannot change the essence of the problem, which is a search process of O(n*m) complexity. I think the most common way to improve performance is to optimize the search instructions. For example, doris (the same is true for clickhouse, this part of the code of doris was copied from clickhouse, and the Volnitsky[11] part was removed), see [5].
This PR refers to the implementation of [5], [6], [8] ：
[5] doris : first-two comapre first, and the issue of page-safe was also mentioned
[6] stringzilla: first-mid-lst compare first
[8] simd-strstr implement by WojciechMula, first-lst comapre first.

They all look few chars first, the more chars you choose to look the higher probability of hitting the optimized path, and the more cost of 'optimized path' itself.

In the end, the solution I chose is to first compare first-last-chars and then compare the remaining bytes. This is because in practice, our users seem to be able to get the matching results through first-last-chars.

ut & benchmark
This part of the code refers to folly’s test code, see [9] [10], of course, some adjustments have been made based on the code implementation.

Benchmark : std::find vs simdStrStr:

findSuccessful(opt_50_5) 31.66ms 31.59
findSuccessful(opt_100_10) 36.81ms 27.16
findSuccessful(opt_100_20) 96.47ms 10.37
findSuccessful(opt_1k_10) 63.27ms 15.81
findSuccessful(opt_1k_100) 156.95ms 6.37
findSuccessful(std_50_5) 45.24ms 22.11
findSuccessful(std_100_10) 191.23ms 5.23
findSuccessful(std_100_20) 322.11ms 3.10
findSuccessful(std_1k_10) 122.69ms 8.15
findSuccessful(std_1k_100) 168.10ms 5.95
findUnsuccessful(std_first_char_match) 819.13ms 1.22
findUnsuccessful(opt_first_char_match) 261.46ms 3.82
findUnsuccessful(std_first_char_unmatch) 249.12ms 4.01
findUnsuccessful(opt_first_char_unmatch) 199.02ms 5.02

References

netlify · 2024-08-27T12:42:56Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`cb6e208`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/670492291a31d70009910cfc

skadilover · 2024-08-29T03:23:57Z

@Yuhta
ping~

from #10731

I don't think you can put variable definition in header file, you probably need to move it to SimdUtil.cpp and have only forward declaration here. Also what is the value if we do MADV_HUGEPAGE for some addresses?

I found that the performance of the code will be reduced by 30% when placed in the cpp file. So for the case, I change the code like this :

  static const int kPageSize = sysconf(_SC_PAGESIZE);

====>
FOLLY_ALWAYS_INLINE bool pageSafe(const void* const ptr, size_t length) {
  static const int kPageSize = sysconf(_SC_PAGESIZE);
  return ((kPageSize - 1) & reinterpret_cast<std::uintptr_t>(ptr)) <=
      kPageSize - CharVector::size - length;
}

MADV_HUGEPAGE for some addresses

I haven't found a good way to handle this situation, could you give me some advice?
Could we assume MADV_HUGEPAGE is always times of _SC_PAGESIZE , 2M vs 4k etc. see https://hugekernel.org/doc/Documentation/admin-guide/mm/transhuge.rst

Modern kernels support "multi-size THP" (mTHP), which introduces the
ability to allocate memory in blocks that are bigger than a base page
but smaller than traditional PMD-size (as described above), in
increments of a power-of-2 number of pages. mTHP can back anonymous
memory (for example 16K, 32K, 64K, etc)

RedHat define the valid values:
https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/6/html/performance_tuning_guide/s-memory-transhuge#s-memory-configure_hugepages

Defines the size of persistent huge pages configured in the kernel at boot time. Valid values are 2 MB and 1 GB. The default value is 2 MB.

Yuhta

Please consider the other suggestions in #10731 (review)

velox/common/base/SimdUtil-inl.h

skadilover · 2024-09-09T03:06:12Z

@Yuhta

Please consider the other suggestions in #10731 (review)

Updates code for the suggestions in #10731:

These checks can be VELOX_DCHECK since their correctness can be ensured quite locally in the same file

done

VELOX_SIMD_STRSTR_CASE

done

aside: Wondering what is the minimal k to make KMP faster than this. This is also related to content of the strings though.

I have extended StringBenchmark to support different algrithm, and kmp is always slower than simdStrStr(maybe my implement is not good so I add std::boyer_moore_searcher for comparison):

findSuccessful(simd_50_to_5)                              344.39ms      2.90
findSuccessful(std_50_to_5)                               802.61ms      1.25
findSuccessful(std_boyer_moore_50_to_5)                      1.42s   703.66m
findSuccessful(kmp_50_to_5)                                  5.23s   191.05m
findSuccessful(simd_100_to_10)                            315.82ms      3.17
findSuccessful(std_100_to_10)                                1.26s   790.97m
findSuccessful(std_boyer_moore_100_to_10)                 984.90ms      1.02
findSuccessful(kmp_100_to_10)                                6.38s   156.70m
findSuccessful(simd_100_to_20)                            352.90ms      2.83
findSuccessful(std_100_to_20)                                2.03s   492.18m
findSuccessful(std_boyer_moore_100_to_20)                 887.87ms      1.13
findSuccessful(kmp_100_to_20)                                6.16s   162.34m
findSuccessful(simd_1000_to_10)                           254.53ms      3.93
findSuccessful(std_1000_to_10)                            698.50ms      1.43
findSuccessful(std_boyer_moore_1000_to_10)                683.83ms      1.46
findSuccessful(kmp_1000_to_10)                               3.33s   300.39m
findSuccessful(simd_1000_to_100)                           65.42ms     15.29
findSuccessful(std_1000_to_100)                           186.01ms      5.38
findSuccessful(std_boyer_moore_1000_to_100)               381.65ms      2.62
findSuccessful(kmp_1000_to_100)                           913.63ms      1.09
findSuccessful(simd_1000_to_200)                          306.99ms      3.26
findSuccessful(std_1000_to_200)                           642.04ms      1.56
findSuccessful(std_boyer_moore_1000_to_200)               683.16ms      1.46
findSuccessful(kmp_1000_to_200)                              9.70s   103.05m
findUnsuccessful(std_first_char_match)                    400.47ms      2.50
findUnsuccessful(opt_first_char_match)                    266.45ms      3.75
findUnsuccessful(std_first_char_unmatch)                  333.54ms      3.00
findUnsuccessful(opt_first_char_unmatch)                  284.63ms      3.51

	simd strstr	std find	std boyer moore	kmp
heystack ：50 needle ：5	344ms	802ms	1.42s	5.23s
heystack : 100 needle ：10	315ms	1.26s	984ms	6.38s
heystack ：100 needle ：20	449ms	2.03s	887ms	1.13s
heystack ：1000 needle ：10	254ms	689ms	693ms	3.3s
heystack ：1000 needle ：100	65ms	186ms	381ms	913ms
heystack ：1000 needle ：200	306ms	642ms	683ms	9.7s

Since this is an almost drop-in replacement for strstr, can you add another test to generate a few thousands of random string pairs with various lengths, and check our result is the same as strstr? Also make sure you cover corner cases like larger needle than haystack, etc.

I update ut refer to folly uts, add random test cases.

Could you have look at this once more?

skadilover · 2024-09-09T15:17:04Z

@Yuhta ping~

Yuhta

The benchmark result looks very good

velox/common/base/SimdUtil-inl.h

skadilover · 2024-09-13T04:25:30Z

@Yuhta update code

pageSize => global variable
fix comments above except this:

Will we miss the opportunity if it is a super long string that is spanning multiple pages?

As we inline kPageSize, pageSafe method is faster than before, I move the check to each batch iteration, we only miss the comparison-span-pages case, maybe better, like clickhouse:
https://github.com/ClickHouse/ClickHouse/blob/master/src/Common/StringSearcher.h

            while (haystack < haystack_end)
            {
                if (haystack + N <= haystack_end && isPageSafe(haystack))
                {

new benchmark result:

findSuccessful(simd_50_to_5)                              355.40ms      2.81
findSuccessful(std_50_to_5)                               807.73ms      1.24
findSuccessful(std_boyer_moore_50_to_5)                      1.40s   712.05m
findSuccessful(kmp_50_to_5)                                  5.18s   192.87m
findSuccessful(simd_100_to_10)                            319.76ms      3.13
findSuccessful(std_100_to_10)                                1.25s   802.85m
findSuccessful(std_boyer_moore_100_to_10)                 910.54ms      1.10
findSuccessful(kmp_100_to_10)                                6.38s   156.72m
findSuccessful(simd_100_to_20)                            355.43ms      2.81
findSuccessful(std_100_to_20)                                2.07s   483.83m
findSuccessful(std_boyer_moore_100_to_20)                 901.23ms      1.11
findSuccessful(kmp_100_to_20)                                6.20s   161.18m
findSuccessful(simd_1000_to_10)                           245.61ms      4.07
findSuccessful(std_1000_to_10)                            724.69ms      1.38
findSuccessful(std_boyer_moore_1000_to_10)                870.93ms      1.15
findSuccessful(kmp_1000_to_10)                               3.33s   300.34m
findSuccessful(simd_1000_to_100)                           69.67ms     14.35
findSuccessful(std_1000_to_100)                           186.06ms      5.37
findSuccessful(std_boyer_moore_1000_to_100)               286.45ms      3.49
findSuccessful(kmp_1000_to_100)                           913.97ms      1.09
findSuccessful(simd_1000_to_200)                          354.15ms      2.82
findSuccessful(std_1000_to_200)                           642.50ms      1.56
findSuccessful(std_boyer_moore_1000_to_200)               675.70ms      1.48
findSuccessful(kmp_1000_to_200)                              9.78s   102.28m
findUnsuccessful(std_first_char_match)                    920.16ms      1.09
findUnsuccessful(opt_first_char_match)                    383.93ms      2.60
findUnsuccessful(std_first_char_unmatch)                  334.28ms      2.99
findUnsuccessful(opt_first_char_unmatch)                  331.98ms      3.01

skadilover · 2024-09-19T02:34:07Z

@Yuhta ping~

Yuhta · 2024-09-20T16:41:09Z

velox/common/base/SimdUtil-inl.h

+  // may not generate a general-protection exception (#GP) in this situation,
+  // and the address that spans the end of the segment may or may not wrap
+  // around to the beginning of the segment.
+  for (; i <= n - needleSize && pageSafe(s + i) &&


If I understand this correctly, we will drop out of this fast path as soon as [s + i, s + i + needleSize) across the page boundary? So data in the subsequent pages would still go through slow path, even they are not crossing page boundary anymore?

So data in the subsequent pages would still go through slow path, even they are not crossing page boundary anymore

yes , your understand correctly, this way maybe better than that put check outside because we could still check part of data.

But I think we can still use the fast path for the second, and any subsequent pages? Why do we need to drop out early?

Let`s detail the case here:

ref : Computer Systems: A Programmer's Perspective section 9.3

Four virtual pages (VP 1, VP 2, VP 4, and VP 7)
are currently cached in DRAM. Two pages (VP 0 and VP 5) have not yet been
allocated, and the rest (VP 3 and VP 6) have been allocated, but are not currently
cached.

I think we only need to avoid to load data at the end of VP 4 because VP 5 is not allocated yet, so we only need to check if the end of input string array is pageSafe

Will we miss the opportunity if it is a super long string that is spanning multiple pages?

And we could only check the lst pointer in substring-search:

if (i + needleSize + CharVector::size > n && !pageSafe(s + i + needleSize - 1)) { break; }

I think we only miss last page in the case that the end of input string is right at the end of VP4 by this way.

@Yuhta update comment and code .

Yeah that's very clear now, thanks

skadilover · 2024-09-25T05:37:30Z

@Yuhta

Any more comments about this ?

Yuhta · 2024-10-02T22:20:47Z

velox/common/base/SimdUtil-inl.h

+    // Assume that the input string is allocated on virtual pages : VP1, VP2,
+    // VP3 and VP4 has not been allocated yet, we need to check the end of input
+    // string is page-safe to over-read CharVector.
+    if (i + needleSize + CharVector::size > n &&


This condition can be loosen to i + needleSize - 1 + CharVector::size > n. Maybe this is more clear:

const auto last = i + needleSize - 1; if (last + CharVector::size > n && !pageSafe(s + last)) { break; } // ... auto blockLast = CharVector::load_unaligned(s + last);

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 27, 2024

Yuhta self-requested a review August 27, 2024 15:16

skadilover force-pushed the opt_strstr branch 7 times, most recently from e99e699 to 457860e Compare August 29, 2024 02:30

skadilover changed the title ~~Add a fast simd strstr~~ Add a fast string-search method named simdStrStr Aug 29, 2024

skadilover changed the title ~~Add a fast string-search method named simdStrStr~~ Add a fast string-search method with simd instructions Aug 29, 2024

skadilover changed the title ~~Add a fast string-search method with simd instructions~~ Add a faster string-search method with simd instructions than std::find Aug 29, 2024

skadilover force-pushed the opt_strstr branch 3 times, most recently from 991be1a to 54422e3 Compare August 30, 2024 02:14

Yuhta reviewed Aug 30, 2024

View reviewed changes

velox/common/base/SimdUtil-inl.h Show resolved Hide resolved

velox/common/base/SimdUtil-inl.h Outdated Show resolved Hide resolved

skadilover force-pushed the opt_strstr branch from 54422e3 to 2cf06d7 Compare September 9, 2024 03:00

skadilover force-pushed the opt_strstr branch 4 times, most recently from 58a2c79 to 3ae8be1 Compare September 10, 2024 05:56

skadilover mentioned this pull request Sep 12, 2024

Add substrings search path for constant like pattern #10731

Closed

skadilover requested a review from Yuhta September 12, 2024 03:06

Yuhta reviewed Sep 12, 2024

View reviewed changes

velox/common/base/SimdUtil-inl.h Show resolved Hide resolved

velox/common/base/SimdUtil-inl.h Outdated Show resolved Hide resolved

velox/common/base/SimdUtil-inl.h Outdated Show resolved Hide resolved

velox/common/base/SimdUtil-inl.h Show resolved Hide resolved

skadilover force-pushed the opt_strstr branch from 3ae8be1 to 13220d3 Compare September 13, 2024 03:59

skadilover force-pushed the opt_strstr branch from 13220d3 to 40a8df4 Compare September 13, 2024 04:26

skadilover requested a review from Yuhta September 13, 2024 04:28

skadilover force-pushed the opt_strstr branch 2 times, most recently from 0e2a09d to b6c6706 Compare September 13, 2024 14:00

Yuhta reviewed Sep 20, 2024

View reviewed changes

skadilover force-pushed the opt_strstr branch 2 times, most recently from 7aa4d26 to 446f09d Compare September 26, 2024 03:04

Yuhta reviewed Oct 2, 2024

View reviewed changes

skadilover added 3 commits October 8, 2024 09:59

add simd strstr

71d8835

add ut / benchmark

a61a454

check the last page

cb6e208

skadilover force-pushed the opt_strstr branch from 446f09d to cb6e208 Compare October 8, 2024 02:00

skadilover requested a review from Yuhta October 8, 2024 02:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a faster string-search method with simd instructions than std::find #10858

Add a faster string-search method with simd instructions than std::find #10858

skadilover commented Aug 27, 2024 •

edited

Loading

netlify bot commented Aug 27, 2024 •

edited

Loading

skadilover commented Aug 29, 2024 •

edited

Loading

Yuhta left a comment

skadilover commented Sep 9, 2024 •

edited

Loading

skadilover commented Sep 9, 2024 •

edited

Loading

Yuhta left a comment

skadilover commented Sep 13, 2024 •

edited

Loading

skadilover commented Sep 19, 2024

Yuhta Sep 20, 2024

skadilover Sep 20, 2024 •

edited

Loading

Yuhta Sep 25, 2024

skadilover Sep 26, 2024 •

edited

Loading

skadilover Sep 26, 2024

Yuhta Oct 2, 2024

skadilover commented Sep 25, 2024

Yuhta Oct 2, 2024

skadilover Oct 8, 2024

Add a faster string-search method with simd instructions than std::find #10858

Are you sure you want to change the base?

Add a faster string-search method with simd instructions than std::find #10858

Conversation

skadilover commented Aug 27, 2024 • edited Loading

Description

Solution

Benchmark : std::find vs simdStrStr:

References

netlify bot commented Aug 27, 2024 • edited Loading

✅ Deploy Preview for meta-velox canceled.

skadilover commented Aug 29, 2024 • edited Loading

Yuhta left a comment

Choose a reason for hiding this comment

skadilover commented Sep 9, 2024 • edited Loading

skadilover commented Sep 9, 2024 • edited Loading

Yuhta left a comment

Choose a reason for hiding this comment

skadilover commented Sep 13, 2024 • edited Loading

skadilover commented Sep 19, 2024

Yuhta Sep 20, 2024

Choose a reason for hiding this comment

skadilover Sep 20, 2024 • edited Loading

Choose a reason for hiding this comment

Yuhta Sep 25, 2024

Choose a reason for hiding this comment

skadilover Sep 26, 2024 • edited Loading

Choose a reason for hiding this comment

skadilover Sep 26, 2024

Choose a reason for hiding this comment

Yuhta Oct 2, 2024

Choose a reason for hiding this comment

skadilover commented Sep 25, 2024

Yuhta Oct 2, 2024

Choose a reason for hiding this comment

skadilover Oct 8, 2024

Choose a reason for hiding this comment

skadilover commented Aug 27, 2024 •

edited

Loading

netlify bot commented Aug 27, 2024 •

edited

Loading

skadilover commented Aug 29, 2024 •

edited

Loading

skadilover commented Sep 9, 2024 •

edited

Loading

skadilover commented Sep 9, 2024 •

edited

Loading

skadilover commented Sep 13, 2024 •

edited

Loading

skadilover Sep 20, 2024 •

edited

Loading

skadilover Sep 26, 2024 •

edited

Loading